Large Scale Multilingual Broadcast Data Collection to Support Machine Translation and Distillation Technology Development
نویسندگان
چکیده
The development of technologies to address machine translation and distillation of multilingual broadcast data depends heavily on the collection of large volumes of material from modern data providers. To address the needs of GALE researchers, the Linguistic Data Consortium (LDC) developed a system for collecting broadcast news and conversation from a variety of Arabic, Chinese and English broadcasters. The system is highly automated, easily extensible and robust and is capable of collecting, processing and evaluating hundreds of hours of content from several dozen sources per day. In addition to this extensive system, LDC manages three remote collection sites to maximize the variety of available broadcast data and has designed a portable broadcast collection platform to facilitate remote collection. This paper will present a detailed a description of the design and implementation of LDC’s collection system, the technical challenges and solutions to large scale broadcast data collection efforts and an overview of the system’s operation. This paper will also discuss the challenges of managing remote collections, in particular, the strategies used to normalize data formats, naming conventions and delivery methods to achieve optimal integration of remotely-collected data into LDC’s collection database and downstream tasking workflow. 1. LDC’s Local Collection The GALE program 1 established annual collection goals of 1000 hours each of broadcast news (BN) and broadcast conversation (BC) in each of Arabic, Chinese and English. At the commencement of GALE, LDC was already collecting recordings (mostly in the BN genre) in the target languages for various projects, including the EARS program and the 2004-2005 TRECVID programs. 2 The technical challenges related to expanding the collection for GALE included integrating various collection modalities (satellite systems, satellite dishes, receivers); managing multiple audio/video streams as they are collected; routing those streams to their assigned system location; scheduling programs to begin and end as directed; accounting for simultaneous broadcasts from different sources; managing the recordings’ processing rate; and making recordings promptly available for downstream tasks (including auditing, automatic speech recognition (ASR) and machine translation output and data selection). 1 “GALE” refers to the program sponsored by the U.S. Defense Advanced Research Projects Agency (DARPA) whose full name is Global Autonomous Language Exploitation. As of the writing of this paper, the program is currently in Phase 4 (Year 4). This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-003. The content of this paper does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred. 2 The EARS (Effective, Affordable, Reusable Speech-to-Text) program was sponsored by DARPA and was conducted from 2003 through 2005. TRECVID is the video retrieval evaluation of the TREC (Text REtrieval Conference) program sponsored by the US National Institute of Standards and Technology (NIST). LDC’s pre-GALE broadcast collection consisted of the following sources: Aljazeera and Lebanese Broadcasting Corp. (LBC) for Arabic; China Central TV (CCTV), New Tang Dynasty TV (NTDTV) and Phoenix TV for Chinese; NBC/MSNBC and CNN for English; and Voice of America (VOA) for Arabic (via Radio Sawa), Chinese and English. For GALE Phase 1, LDC quickly determined that additional sources and hours could be accommodated in the existing collection infrastructure through cable sources. This was necessary because there were a limited number of receivers in the existing collection, and those receivers were deployed for cable sources. Additional programming from Ajazeera and a new source, Al Arabiya, were added for Arabic, focusing mainly on the BC genre. For Chinese, new BC programs for CCTV, NTDTV and Phoenix were selected. LDC added additional receivers to increase the range of its Arabic collection in Phase 2, responding to sponsor requests for greater representation of programming across the Arabic-speaking region, particularly the Gulf region and Iraq. A number of Arabic sources were available from free-to-air (FTA) satellites transmitting over the Philadelphia area. LDC designed program surveys of the various sources that ran roughly twice per hour for several days. The collection manager, assisted by native speakers, reviewed the survey recordings and established a recording schedule. Unlike cable sources, the FTA sources do not typically maintain scheduling information. The new Arabic collection boasted thirteen sources with broad coverage including Iraq (Al Iraqiyah), various Emirate states (Abu Dhabi TV, Oman TV, Saudi TV), Iran (Al Alam News Channel) and Syria (Syria TV). In early Phase 4, LDC added several regional Chinese sources in response to sponsor requests for an increased variety of that material. LDC upgraded its DISH Network Chinese subscription to access more regional programming and added the necessary equipment to commence that collection. New sources included Beijing TV, Dragon TV, Fujian TV, Guangdong TV, Hunan TV and Jiangsu TV. As of the writing of this paper in late Phase 4, LDC is enhancing its Arabic broadcast conversation collection at the sponsor’s request, targeting programs containing Iraqi and North African Arabic dialects. Throughout the GALE program, LDC has continued to refine and normalize the local collection schedule. LDC currently collects approximately 205 hours/week of programming from 27 broadcast sources which break down by language as follows: Arabic (15 sources, 115 hours/week) Chinese (9 sources, 63 hours/week) and English (3 sources, 27 hours/week). The combined local and outsourced collection generates approximately 335 hours of programming weekly from 41 broadcast sources. 2. Broadcast Collection System Design and Operation Part of the design intent driving the development of LDC’s broadcast collection system was that it be modular and regularized. That meant that all of the recording nodes should be interchangeable, that filenames and database fields should follow consistent, formal rules and that signal interconnects should be consistent. The receivers feed into an audio/video (A/V) matrix switch so that any source can be routed to any receiver simply by changing an entry in the schedule. The audio/video streams are first digitized as DV25 video stream plus stereo audio (the common denominator) which can then be used to derive whatever container and compression method is required for a given project. The broadcast material is served to the system by a set of FTA satellite receivers, commercial direct satellite systems (DSS) such as DirecTV, direct broadcast satellite (DBS) receivers, and cable television (CATV) feeds. The mapping between receivers and records is dynamic and modular; all signal routing is performed under computer control, using a 256x64 A/V matrix switch. Programs are recorded in a high bandwidth A/V format and are then processed to extract audio, to generate key frames and compressed audio/video, to produce time-synchronized closed captions (in the case of North American English) and to generate ASR output. The recordings and their extracted content are stored in a high performance and highly reliable storage solution which is accessible to LDC teams for auditing, data selection and annotation. Initial recordings consist of video, stereo audio, and in case of English sources, closed captions. LDC collects 3 “DV25” refers to IEC Digital Video @ 25Mps, SMPTE-314M, IEC-61834. both audio and video data for each recording so that this material can be reusable for a variety of research purposes and because having access to the video portion of a given broadcast aids troubleshooting system functions and makes auditing more reliable, more efficient and less error-prone. Recordings are typically transcoded to MPEG-4/AVC at 1 Mbps shortly after capture. The collection system is illustrated in the block diagram below: Figure 1: LDC Broadcast Collection System Block Diagram 2.1 Collection Database The collection schedule is stored in a relational database using a Mysql database server. The database contains a history of all of the recordings that have been made; it has configuration and status information for all recorders; it has information about all receivers and associates specific programs of interest with the appropriate receiver; it contains a schedule of all recording jobs that need to be executed and their status; and it stores all audit judgments associated with a given recording. The collection database consists of the following tables: bn_source: A “source” refers to a specific content creator associated with a specific reception mode. For example, “CNN Headline News” received via DirecTV receiver 01 would constitute a specific source. Each source has an associated input identifier, which refers to an input port in the routing context of the system. bn_program: This belongs to a source, has a language and a typical day/time of airing. bn_recdev: A hardware resource, identified by IP address and device ID. It has a specific output identifier which refers to an output port in the routing context of the system. bn_sched: Associates a recorder with a program. bn_recordings: Each recording has an entry in this table with creation date and time, associated filenames with md5 checksums, auditing summary information and other metadata. bn_audit: Human judgments about the content and signal quality for multiple subsections of each recording are stored in this table. Auditors review recordings for correct language, genre and program and also note technical problems and degraded signal conditions. 2.2 Computer Hardware Schema The broadcast collection system is organized as a mini-cluster of linux computers with a master node, a fileserver, eight client recording nodes and three transcoding nodes. The master node runs the database server and dispatches recording jobs according to the defined schedule. The client nodes are functionally equivalent. Each node has two Canopus ACEDVio digitizers which appear as firewire interfaces to the node’s operating system. The master scheduler on the control computer is called bn_run_now. This script actively monitors the database for any scheduled jobs, reserves recording devices and sends job command strings to the appropriate listen ports on client recording nodes. xinetd runs on client nodes and waits for connections from the scheduler, then starts an instance of bncap.pl. This script runs on every recording node, and upon signal from the scheduler, begins raw video capture with the dvgrab utility. Once video capture is complete, bncap.pl extracts and downsamples the audio channels into separate A and B channels with SoX and uses mencoder to convert the video into .avi format. The files are uploaded to the collections server, after which bncap.pl connects to the scheduler database and resets the program and recording device values to inactive. In addition, the script inserts recording information into the bn_recordings table. The system’s hardware configuration is as follows: Control Computer x1 (thalia) File Server/Compute Server x1 (kronus) Mulitprocessor AMD Opteron Servers (Tyan Transport) 7TB Storage running Ubuntu Linux
منابع مشابه
Speech and Natural Language Software Bulletin Systran Enables Multilingual Customer Support for Autodesk
Autodesk’s deployment of Systran’s machine translation (MT) technology is an early example of enterprise machine translation for multilingual customer support. IDC believes the implementation of MT by a company of Autodesk’s scale and global reach represents the beginning of new commercial opportunities for Systran and other MT vendors. The return on investment (ROI) for applications that subst...
متن کاملMultilingual Processing for Operational Users
This paper describes multilingual technology projects currently being undertaken in conjunction with the NATO BICES (Battlefield Information Collection and Exploitation) organization. First, we describe the basis of the multilingual processing for these projects, the CyberTrans machine translation environment, an operational system that enables the use of machine translation (MT) by intelligenc...
متن کاملA New Play-off Approach in League Championship Algorithm for Solving Large-Scale Support Vector Machine Problems
There are many numerous methods for solving large-scale problems in which some of them are very flexible and efficient in both linear and non-linear cases. League championship algorithm is such algorithm which may be used in the mentioned problems. In the current paper, a new play-off approach will be adapted on league championship algorithm for solving large-scale problems. The proposed algori...
متن کاملEnabling technology for multilingual natural language generation: the KPML development environment
Natural language generation is now moving away from research prototypes into more practical applications. Generation functionality is also being asked to play a more signi cant role in established applications such as machine translation. In both cases, multilingual generation techniques have much to o er. However, the take-up of multilingual generation is being restricted by a critical lack bo...
متن کاملPhrase-Based Statistical Machine Translation for MANOS System
MANOS (Multilingual Application Network for Olympic Services) project. aims to provide intelligent multilingual information services in 2008 Olympic Games. By narrowing down the general language technology, this paper gives an overview of our new work on Phrase-Based Statistical Machine Translation (PBT) under the framework of the MANOS. Starting with the construction of large scale Chinese-Eng...
متن کاملTopic Labeling of Multilingual Broadcast News in the Informedia Digital Video Library
Informedia Digital Video Library Alexander G. Hauptmann, Danny Lee and Paul E. Kennedy Abstract The Informedia Digital Video Library Project includes a multilingual component for retrieval of video documents in multiple languages and a topic-labeling component for English video documents. We now extend this capability to English topic labeling of foreign-language broadcast-news stories. News st...
متن کامل